19 research outputs found

    Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

    Get PDF
    Containerization technology offers lightweight OS-level virtualization, and enables portability, reproducibility, and flexibility by packing applications with low performance overhead and low effort to maintain and scale them. Moreover, container orchestrators (e.g., Kubernetes) are widely used in the Cloud to manage large clusters running many containerized applications. However, scheduling policies that consider the performance nuances of containerized High Performance Computing (HPC) workloads have not been well-explored yet. This paper conducts fine-grained scheduling policies for containerized HPC workloads in Kubernetes clusters, focusing especially on partitioning each job into a suitable multi-container deployment according to the application profile. We implement our scheduling schemes on different layers of management (application and infrastructure), so that each component has its own focus and algorithms but still collaborates with others. Our results show that our fine-grained scheduling policies outperform baseline and baseline with CPU/memory affinity enabled policies, reducing the overall response time by 35% and 19%, respectively, and also improving the makespan by 34% and 11%, respectively. They also provide better usability and flexibility to specify HPC workloads than other comparable HPC Cloud frameworks, while providing better scheduling efficiency thanks to their multi-layered approach.Comment: HPCC202

    Performance characterization of containerization for HPC workloads on InfiniBand clusters: an empirical study

    Get PDF
    Containerization technology offers an appealing alternative for encapsulating and operating applications (and all their dependencies) without being constrained by the performance penalties of using Virtual Machines and, as a result, has got the interest of the High-Performance Computing (HPC) community to obtain fast, customized, portable, flexible, and reproducible deployments of their workloads. Previous work on this area has demonstrated that containerized HPC applications can exploit InfiniBand networks, but has ignored the potential of multi-container deployments which partition the processes that belong to each application into multiple containers in each host. Partitioning HPC applications has demonstrated to be useful when using virtual machines by constraining them to a single NUMA (Non-Uniform Memory Access) domain. This paper conducts a systematical study on the performance of multi-container deployments with different network fabrics and protocols, focusing especially on Infiniband networks. We analyze the impact of container granularity and its potential to exploit processor and memory affinity to improve applications’ performance. Our results show that default Singularity can achieve near bare-metal performance but does not support fine-grain multi-container deployments. Docker and Singularity-instance have similar behavior in terms of the performance of deployment schemes with different container granularity and affinity. This behavior differs for the several network fabrics and protocols, and depends as well on the application communication patterns and the message size. Moreover, deployments on Infiniband are also more impacted by the computation and memory allocation, and because of that, they can exploit the affinity better.We thank Lenovo for providing the testbed to run the experiments in this paper. This work was partially supported by Lenovo as part of Lenovo-BSC collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under Grant No. 2020 FI-B 00257.Peer ReviewedPostprint (published version

    Performance characterization of multi-container deployment schemes for online learning inference

    Get PDF
    Online machine learning (ML) inference services provide users with an interactive way to request for predictions in realtime. To meet the notable computational requirements of such services, they are increasingly being deployed in the Cloud. In this context, the efficient provisioning and optimization of ML inference services in the Cloud is critical to achieve the required performance and meet the dynamic queries by end-users. Existing provisioning solutions focus on framework parameter tuning and infrastructure resources scaling, without considering deployments based on containerization technologies. The latter promises reproducibility and portability features for ML inferences services. There is limited knowledge about the impact of distinct deployment schemes at the container-level on the performance of online ML inference services, particularly on how to exploit multi-container deployments and its relation with processor and memory affinity. In light of this, in this paper we investigate experimentally the containerization of ML inference services and analyze the performance of multi-container deployments that partition the threads belonging to an online learning application into multiple containers in each node. This paper shares the findings and lessons learned from conducting realistic client patterns on an image classification model across numerous deployment configurations, especially including the impact of container granularity and its potential to exploit processor and memory affinity. Our results indicate that fine-grained multi-container deployments and affinity are useful for improving performance (both throughput and latency). In particular, our experiments on single-node and four-node clusters show up to 69% and 87% performance improvement compared to the single-container deployment, respectively.This work was partially supported by Lenovo as part of Lenovo-BSC collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2021-SGR-00478 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    Scanflow-K8s: agent-based framework for autonomic management and supervision of ML workflows in Kubernetes clusters

    Get PDF
    Machine Learning (ML) projects are currently heavily based on workflows composed of some reproducible steps and executed as containerized pipelines to build or deploy ML models efficiently because of the flexibility, portability, and fast delivery they provide to the ML life-cycle. However, deployed models need to be watched and constantly managed, supervised, and debugged to guarantee their availability, validity, and robustness in unexpected situations. Therefore, containerized ML workflows would benefit from leveraging flexible and diverse autonomic capabilities. This work presents an architecture for autonomic ML workflows with abilities for multi-layered control, based on an agent-based approach that enables autonomic management and supervision of ML workflows at the application layer and the infrastructure layer (by collaborating with the orchestrator). We redesign the Scanflow ML framework to support such multi-agent approach by using triggers, primitives, and strategies. We also implement a practical platform, so-called Scanflow-K8s, that enables autonomic ML workflows on Kubernetes clusters based on the Scanflow agents. MNIST image classification and MLPerf ImageNet classification benchmarks are used as case studies to show the capabilities of Scanflow-K8s under different scenarios. The experimental results demonstrate the feasibility and effectiveness of our proposed agent approach and the Scanflow-K8s platform for the autonomic management of ML workflows in Kubernetes clusters at multiple layers.This work was supported by Lenovo as part of Lenovo-BSC 2020 collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    Scanflow: an end-to-end agent-based autonomic ML workflow manager for clusters

    Get PDF
    Machine Learning (ML) is more than just training models, the whole life-cycle must be considered. Once deployed, a ML model needs to be constantly managed, supervised and debugged to guarantee its availability, validity and robustness in dynamic contexts. This demonstration presents an agent-based ML workflow manager so-called Scanflow1, which enables autonomic management and supervision of the end-to-end life-cycle of ML workflows on distributed clusters. The case study on a MNIST project2 shows that different teams can collaborate using Scanflow within a ML project at different phases, and the effectiveness of agents to maintain the model accuracy and throughput of the model serving while running in production.This work was partially supported by Lenovo as part of LenovoBSC 2020 collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Postprint (published version

    Human-in-the-loop online multi-agent approach to increase trustworthiness in ML models through trust scores and data augmentation

    Get PDF
    Increasing a ML model accuracy is not enough, we must also increase its trustworthiness. This is an important step for building resilient AI systems for safety-critical applications such as automotive, finance, and healthcare. For that purpose, we propose a multi-agent system that combines both machine and human agents. In this system, a checker agent calculates a trust score of each instance (which penalizes overconfidence in predictions) using an agreement-based method and ranks it; then an improver agent filters the anomalous instances based on a human rule-based procedure (which is considered safe), gets the human labels, applies geometric data augmentation, and retrains with the augmented data using transfer learning. We evaluate the system on corrupted versions of the MNIST and FashionMNIST datasets. We get an improvement in accuracy and trust score with just few additional labels compared to a baseline approach.This work was supported by Lenovo as part of LenovoBSC 2020 collaboration agreement, by the Spanish Government under contracts PID2019-107255GB-C21 and PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    An Extract of Antrodia camphorata Mycelia Attenuates the Progression of Nephritis in Systemic Lupus Erythematosus-Prone NZB/W F1 Mice

    Get PDF
    Antrodia camphorata is used in folk medicine for the treatment of inflammation syndromes and liver-related diseases in Taiwan. The goal of this study was to evaluate the efficacy of the mycelial extract of A. camphorata (ACE) for the treatment of systemic lupus erythematosus (SLE) in SLE-prone NZB/W F1 mice. After antibodies against double-stranded DNA appeared in NZB/W mice, the mice were orally administered varying dosages of ACE (100, 200 and 400 mg kg−1) for 5 consecutive days per week for 12 weeks via gavage. To assess the efficacy of ACE, we measured SLE-associated biochemical and histopathological biomarkers levels of blood urine nitrogen (BUN), blood creatinine, urine protein and urine creatinine and thickness of the kidney glomerular basement membrane by staining with periodic acid-Schiff. Antroquinonol, an active component of ACE, was investigated for anti-inflammation activity in lipopolysaccharide-induced RAW 267.4 cells. ACE at 400 mg kg−1 significantly suppressed urine protein and serum BUN levels and decreased the thickness of the kidney glomerular basement membrane. Antroquinonol significantly inhibited the production of tumor necrosis factor-α and interleukin-1β by 75 and 78%, respectively. In conclusion, ACE reduced urine protein and creatinine levels and suppressed the thickening of the kidney glomerular basement membrane, suggesting that ACE protects the kidney from immunological damage resulting from autoimmune disease

    Antrodia camphorata Mycelia Attenuates the Progression of Nephritis in Systemic Lupus Erythematosus-Prone NZB/W F1 Mice

    Get PDF
    Antrodia camphorata is used in folk medicine for the treatment of inflammation syndromes and liver-related diseases in Taiwan. The goal of this study was to evaluate the efficacy of the mycelial extract of A. camphorata (ACE) for the treatment of systemic lupus erythematosus (SLE) in SLE-prone NZB/W F1 mice. After antibodies against double-stranded DNA appeared in NZB/W mice, the mice were orally administered varying dosages of ACE (100, 200 and 400 mg kg −1 ) for 5 consecutive days per week for 12 weeks via gavage. To assess the efficacy of ACE, we measured SLE-associated biochemical and histopathological biomarkers levels of blood urine nitrogen (BUN), blood creatinine, urine protein and urine creatinine and thickness of the kidney glomerular basement membrane by staining with periodic acid-Schiff. Antroquinonol, an active component of ACE, was investigated for antiinflammation activity in lipopolysaccharide-induced RAW 267.4 cells. ACE at 400 mg kg −1 significantly suppressed urine protein and serum BUN levels and decreased the thickness of the kidney glomerular basement membrane. Antroquinonol significantly inhibited the production of tumor necrosis factor-α and interleukin-1β by 75 and 78%, respectively. In conclusion, ACE reduced urine protein and creatinine levels and suppressed the thickening of the kidney glomerular basement membrane, suggesting that ACE protects the kidney from immunological damage resulting from autoimmune disease

    An architecture for automatic ML/AI workflow management and supervision

    Get PDF
    Scientific computation problems have been faced with the need to analyze increasing amounts of data as part of their application workflows, and the science-based model is being combined with big data and machine learning models to solve complex problems and phenomena [1][2]. The machine learning workflow is composed of some reproducible steps that can be executed as a pipeline to build a model efficiently by saving iteration time, helping in debugging and detecting [3]. Currently, businesses and researchers are investigating and improving the methodology of developing and deploying machine learning workflows in both training and inference phases, which helps the data science team focus on their requirements and the data engineer team deploy and operate machine learning workflows efficiently and automatically [4]. This work presents an architecture for automatic machine learning workflows, which provides capabilities of monitoring and automatic management on the end-to-end life-cycle of machine learning workflows, including tracking and observing at the training stage, and releasing, monitoring, deployment, auto-detecting and infrastructure management at the inference stage. To validate feasibility, we have conducted a case study based on our architecture and deployed it in the cloud, and showed its automation

    Performance comparison of multi-container deployment schemes for HPC workloads: an empirical study

    Get PDF
    The high-performance computing (HPC) community has recently started to use containerization to obtain fast, customized, portable, flexible, and reproducible deployments of their workloads. Previous work showed that deploying an HPC workload into a single container can keep bare-metal performance. However, there is a lack of research on multi-container deployments that partition the processes belonging to each application into different containers. Partitioning HPC applications has shown to improve their performance on virtual machines by allowing to set affinity to a non-uniform memory access (NUMA) domain for each of them. Consequently, it is essential to understand the performance implications of distinct multi-container deployment schemes for HPC workloads, focusing on the impact of the container granularity and its combination with processor and memory affinity. This paper presents a systematic performance comparison and analysis of multi-container deployment schemes for HPC workloads on a single-node platform, which considers different containerization technologies (including Docker and Singularity), two different platform architectures (UMA and NUMA), and two application subscription modes (exact subscription and over-subscription). Our results indicate that finer-grained multi-container deployments, on the one side, can benefit the performance of some applications with low interprocess communication, especially in over-subscribed scenarios and when combined with affinity, but, on the other side, they can incur some performance degradation for communication-intensive applications when using containerization technologies that deploy isolated network namespaces.This work was partially supported by Lenovo as part of Lenovo-BSC collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft
    corecore